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Abstract 

We prove results on the relative representational power of mixtures of 
product distributions and restricted Boltzmann machines (products of mix- 
tures of pairs of product distributions). Tools of independent interest are 
j ^ mode-based polyhedral approximations sensitive enough to compare full- 

dimensional models, and characterizations of possible modes and support 
sets of probability distributions represented by both model classes. We 
find, in particular, that an exponentially larger mixture model, requiring 
an exponentially larger number of parameters, is required to represent prob- 
ability distributions that can be represented by the restricted Boltzmann 
machines. The title question is intimately related to questions in coding 
£f~^ theory and point configurations in hyperplane arrangements. 
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O 1 Introduction 

Fixing the number of parameters, are restricted Boltzmann machines (RBMs) [30] [TUJ [14] 
C^l US] better than mixtures of product distributions (MoPs) at approximating interesting or 

complex probability distributions? We use the number of modes of probability distributions 
as a measure of complexity and solve the following: 

I^j Problem 1. Given some n, m G N, what is the smallest k G N for which every proba- 

bility distribution from the RBM model with n visible and m hidden binary units can be 
^ represented as a mixture of k binary product distributions? 

We find that the number of parameters of the smallest MoP model containing an RBM 
model grows exponentially in the number of parameters of the RBM for any fixed ratio 



0<m/n<oo. Theorem 29 gives a solution of order \og 2 (k) = 0(min{m,n}), see Figure [T 
Probability distributions with many strong modes, local maxima on the set of inputs {0, ly 1 
equipped with the Hamming distance, can be represented more compactly by RBMs than 
by MoPs. 

Section [2] gives some background on mixtures of product distributions and products of mix- 
tures. Section [3] analyzes the sets of modes realizable by mixtures of product distributions. 
Section [4] analyzes the input space partitions and sets of modes that can be realized by 
RBMs, relating them to linear threshold functions, hyperplane arrangements, and zono- 
topes. Theorems [5j [l4j and [23] characterize strong modes of probability distributions in 
MoPs and RBMs. Section [5] contains our analysis of Problem [I] Section [6] discusses the 
complementary problem to the title question: When does an RBM contain a mixture of 



products? Theorem 32 shows that in general a MoP model is not contained in an RBM 
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Figure 1: Heat map of log 2 (fc) depending on m, n G N, where k = mm{k' : M n ,k' 5 
RBM n m }. The domain of this function has three regions, each with approximately lin- 
ear behavior (Theorem [29]) . An RBM of dimension c has hyperparameters from the set 
{(n,m): nm + n + m =~cj (the dashed hyperbola labeled "dim(RBM njm ) = c"). Fixing 
dimension, the RBMs which are hardest to represent as mixtures of product distributions 
are those where m/n « 1. 



model of the same dimension. We defer some proofs to Appendix [Aj In Appendix [B] we de- 
scribe the relationship between the sets of modes of RBMs and the multi- covering numbers 
of hyper cubes, and compute a few examples. 

2 Preliminaries 

2.1 Mixtures of products and products of mixtures 

Let V n denote the (2 n — l)-dimensional simplex of probability distributions on {0, l} n . 
The set of product distributions p(xi, . . . ,x n ) = Pi(xi) • • -p n (x n ), (#i, . . . ,x n ) G {0, l} n is 
denoted by M n ,i- This set is the closure of the n-dimensional exponential family 

p(x) = — exp( S2 BiXi) for all x G {0, l} n , 

ie[n] 

with natural parameter B G W 1 and partition function Z = X^e{o i}n ex P(Xie[n] ^iVi)- 
The /c-mixture of product distributions M n ,k is the set of distributions in V n expressible as 
convex combinations p = J2ie[k] ^iQ^ °f ^ product distributions £ A^ n ,i with mixture 
weights A» > 0, ^ iG[fc] Ai = l. 

The RBM model with n visible and m hidden binary units, RBM n?m , is the set of distribu- 
tions on {0, l} n that can be approximated arbitrarily well by distributions of the form 

p(x) = - ^v(hWx + Bx + Ch) for all x G {0, l} n , 

he{o,i} m 

where W Gl mxn is the matrix of interaction weights between the hidden and visible units, 
B G W 1 is the vector of bias weights of the visible units, C G M m is the vector of bias weights 
of the hidden units, and Z = X^e{o i} n J2he{o i} m ex ^p(hWx + + Ch) is the partition 
function. 

The models M n k and RBM n5m have graphical representations with hidden variables as 
shown in Figure |Sj 
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M 6 ,k RBM 6 , 4 



Figure 2: Graphical representations of the model A4q^, consisting of mixtures of k product 
distributions from A^6,i (Mixture of Products), and of the model RBM6,4, consisting of 
products of four mixture distributions from A^6,2 (Product of Mixtures). 



The set M n ,k has the dimension expected from counting parameters, mm{nk+(k — 1), 2 n — 1}, 
unless n = 4 and k = 3 when it has dimension 13 instead of 14, see jj. The dimension 
of RBM n5m is known to be equal to the number of parameters, nm + n + m, when m < 
2 n-rio g2 (n+i)l 5 and equal to 2 n - 1 when m > 2 n -L lo s 2 (n+i)J gee Figure |3] 

min{nm + n + m, 2 n — 1} 

x 1 4 




50 100 150 200 

n 

Figure 3: The expected dimension of the models RBM nm and A4 n ,m+i- 

The RBM model is a product of experts [13] with one expert per hidden unit. Each 
expert is a mixture of two product distributions, see [6], and hence the RBM model 
is a (Hadamard) product of mixtures. In a complementary way, we may view the dis- 
tributions p G RBM n5m as restricted mixtures of 2 m product distributions p(x\h) = 
exp((hW + B)x)/^2 ye{0 ^ 1}n exp((hW + B)y) G M n ,i for h G {0,l} m . In general the 
dimension of the mixture model A4 n ,2 m is much larger than that of RBM n>m . 

The model M. n ^ is equal to V n if and only if k > 2 n_1 [18] . and hence the smallest MoP 
universal approximator of distributions on {0, l} n has 2 n_1 (n + 1) — 1 parameters. The 
model RBM n?m equals V n whenever m > 2 n_1 — 1 [15] . It is not known whether this bound 
is tight, but it shows that the smallest RBM universal approximator of distributions on 
{0, l} n has at most 2 n_1 (n + 1) — 1 parameters, and hence not more than a MoP universal 
approximator. 

Since we are arguing that the sets of probability distributions representable by RBMs and 
MoPs are quite different, it is interesting to consider what is known about when these two 
sets do intersect. In [20] it is shown that RBM n5m and M n ,m+i intersect at sets of dimension 
of order (m + l)n + (m + 1) + n — (m — 1) log 2 (m + 1). In Section [6] we show that when 
2m < n and (n, m) ■=£ (4, 3) these two models intersect at sets of dimension strictly less than 
nm + n + m. 
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Figure 4: Inference regions of M.2,4 (left), and of RBM2,3 (right), for a choice of parameters. 
2.2 Inference regions and distributed representations 

We comment briefly on the inference regions of MoPs and RBMs and the widespread intu- 
ition that products of experts are more powerful than mixtures of experts. 

Hinton [13] discusses the advantages of products of experts over mixtures of experts, such 
as MoPs, for modeling "high-dimensional data which simultaneously satisfies many low- 
dimensional constraints". In these models each expert can individually ensure that one 
constraint is satisfied. This is related to the notion of distributed representations discussed 
by Bengio in [2j Section 5.3]. Each RBM hidden unit linearly divides the input space 
according to its preferred state given the input, which results in a multi- clustering, or a 
partition of the input space into cells where different joint hidden states are most likely. 

An inference function of a probabilistic model po(v : h) with parameter G Vt 'explains' each 
value of v by the most likely value of h by ttq : v \-> axgmax h po(h\v). This defines a partition 
of the input space, where v lives, into the preimages of all possible outputs, called inference 
regions. 

For each choice of parameters, the model RBM n?m defines an inference function 

n WtBtC : K n D{0,l}" ^{0,l} m ; « ^ argmax h6{0il}m h(Wv + C) . (1) 

The visible state v is explained by the hidden state h for which sgn(Wv + C) = sgn(/i — 
|(1, . . . , 1)). The input space R n is partitioned into the preimages of the orthants of R m 
by the affine map z : R n — >• R m ;v \-> Wv + C. The number of inference regions can be 

as large as (£ a ff (m, d) = J2i=o (7)> ^ = mm { n ? m }> which is the number of orthants of IR m 
intersected by a generic d-dimensional affine subspace. When the rank of W is less than m 
(e.g., when m > n), the image of the map z does not intersect all orthants of R m and there 
are 'empty' inference regions. 

For any choice of mixture weights and natural parameters G R n for i G [fc], the 
mixture model A4 n ,k defines an inference function 

7r A , B : v^argmax iG[fc] (5 (i) v-logZ (i) ), = ^ exp(5 (i) v) . (2) 

ve{o,i} n 

The input space is partitioned into the at most k regions of linearity of the function v \-> 
max{B^v - log : i G [k]}. 

Figured shows an example of the partitions of {0, l} 2 C R 2 defined by the models M.2,4 
and RBM2,3 for some specific parameter values. Both models have 7 parameters and are 
universal approximators of distributions on {0, l} 2 , but they define very different inference 
regions. For a fixed input space dimension n, the number of inference regions in R n that can 
be realized by RBM n)m is of order B(( min ^ m }))> exponential in the number of parameters, 
whereas the number of inference regions of MoPs is linear in the number of parameters. 
Distributed representations can, in principle, learn explanations to a number of observations 
that is exponential in the number of model parameters and training examples [2]. 
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Figure 5: This figure shows the interior of the 3-dimensional probability simplex (a tetrahe- 
dron with vertices corresponding to the outcomes (0, 0), (0, 1), (1,0), (1,1)), with three sets 
of probability distributions depicted. The curved set is the 2-dimensional manifold of 
product distributions of two binary variables. The angular regions at the top and bottom 
are the polyhedra and (?^rFl 



2.3 Modes 

We use the number of local maxima (modes) of a probability distribution as a proxy for the 
distribution's complexity. We will characterize the ability of RBMs and mixtures of products 
(naive Bayes models) to represent distributions which are complex in this sense, in order to 
draw a distinction between them. In particular, we will ask what is the smallest k G N for 
which a model contains a distribution with / strong modes. Corresponding questions have 
been posed for mixtures of multivariate normal distributions [26] . The maximal number of 
modes realizable by mixtures of k normal distributions on R n is unknown. 

Let ^ be a set of strings of length n and let x,y G X. We denote by dn(x,y) := \{i G 
[n] : Xi ^ Vi}\ the Hamming distance between x and y. 

Definition 2. Let p be a probability distribution on X. We call x G X a mode of p if p(x) > 
p(y) for all y G X with d H (y,x) = 1, and a strong mode if p(x) > ^2 yeX:dH (y iX ) =1 p(y). 

We write (5 n ,™ (and H n ,m) for the set of distributions in V n which have at least m modes 
(strong modes). For any C C {0, l} n , we write Qc (and He) for the set of distributions which 
have modes (strong modes) C. The closures Qc (and He) are convex poly topes inscribed in 
the probability simplex V n . 

The modes of a distribution encode events that are Hamming-locally most likely in the space 
of possible events. The sets of modes of probability distributions are closely related to the 
support sets of statistical models. The latter have been studied especially for hierarchical 
and graphical models without hidden variables [I2j [I6j [25] . The sets of modes that are not 
realizable by a statistical model give a full dimensional polyhedral approximation of the 
model's complement. 

We will study the inference regions of RBMs and MoPs and relate them to the sets of modes 
and support sets of distributions that can be represented by these models. We focus most 
of our consideration on strong modes, which are easier to study than modes; in particular 
because they are described by fewer inequalities. 

The minimum Hamming distance of a set C C X is defined as dn(C) := min{<i^(x, y) : x ^ 
y and x, y G C}. Since any two modes have at least Hamming distance two from each other, 
a distribution on {0, l} n has at most 2 n_1 modes. There are exactly two subsets of {0, l} n 
with cardinality 2 n_1 and minimum distance two. These are the sets of binary strings with 

2 Interactive 3-dimensional graphic object available at www.personal.psu.edu/gfmlO/blogs/ 
gfmc_blog/indepm o .pdf I (open in Adobe Acrobat Reader 7 or higher). 
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an even, respectively odd, number of entries equal to one 

Z+, n :={( Xl ,...,x n ) G {0,l} n : ^ x % is even} and Z_, n := {0, l} n \ Z +>n . (3) 

Hence Sn^™- 1 = Gz+, n U !?z_ ;r ,, or for short (5 n = U Q~ . We define H n similarly. 

Figure [5] illustrates the three-dimensional sets and Q^-, and the two-dimensional set A^2,i 
of product distributions on {0, l} 2 . 

Example 3. The set of distributions in V3 with modes is the intersection of V3 

and 12 open half-spaces defined by p(x) > p(y) for all y with cLh(x, y) = 1 for all x G £+,3. 

The closure of this set is a convex polytope Q% with 19 vertices. The Lebesgue volume can 
be computed (e.g., using the software Polymake [TT]): vol^^)/ vol(Ps) ~ 0.0179. The set 
of distributions in P n with four strong modes on H^, is an intersection of 4 half-spaces 
defined by p(x) > J2 y:dH{ ^ y)=1 p(y) for x G Z +?3 . 

3 Modes of mixtures of product distributions 

In this section, we characterize the sets of modes and strong modes that can appear in 
mixtures of product distributions, and show how this can be used to obtain a polyhedral 
approximation of the set of probability distributions represent able in such models. 

Problem 4. What is the smallest k G N for which A4 Uj k contains a distribution with I 
(strong) modes? 

One can show that a mixture of k unimodal discrete probability distributions has at most 
k strong modes. The next result is given for the case of mixtures of product distributions 
with finite valued variables, including mixtures of binary product distributions as a special 
case: 

Theorem 5. Let M. be the k-mixture of the set of product distributions pi(xi) • • ■ p n (x n ) 
with variables X{ G Xi, \Xi\ < 00 for i G [n]. The sets of strong modes of probability 
distributions in the model M are exactly the sets of strings in X\ x • • • x X n of minimum 
Hamming distance at least two and cardinality at most k. Furthermore, if p is a mixture of 
product distributions and p has strong modes C, then every y G C is the mode of one mixture 
component of p. 

Proof Any product distribution q has at most one mode, because the value of the product 
qi(xi) ■ • • q n (x n ) is either maximal, or can be increased flipping only one entry of x. If q 
is a product distribution and x is not a mode of q, then there is a y with q(y) > q(x) and 
dii{y,%) = 1. If q^ is a product distribution for j G [k] and x is not a mode of any qi , then 

T,y. dH (y,x)=i T, je [k] a jQ j (y) > Y,je[k] ctj<l j ( x ) for an y a j > 0> and x is not a stron g mode 
of p. On the other hand, the set of product distributions contains every point measure 5 y: 
(which is just the product distribution qi(xi) • • • q n (x n ) with qi{yi) = 1 for all i G [n]), and 
hence M. contains any distribution X^ec ]t\^y w ^ n |C| < fe. □ 

Remark 6. Although a mixture of k product distributions can have at most k strong modes, 
it can have more than k modes. For instance, there are mixtures of two product distributions 
on {0, l} 4 which have more than two modes. 

Theorem [5] shows that (V n \A4 n ,m) 5 ^n,m+i for all m. We can triangulate H n ,m+i and 
bound the volume of the complement of A^ n ,m- A rough estimate is: 

Proposition 7. When m < 2 n_1 the Lebesgue volume of the complement of M. n ,m is at 
least vol(n n ,m+i)/yol(V n ) > 2-( m+1 ) n if(ra+l) ; where K(m+1) = 2 m+1 z/m+1 < 2 k < ^ 
for some k, and K(m + 1) = 2 otherwise. 

Proof See Appendix. □ 
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3.1 Polyhedral approximation of the full-dimensional model A43,3 

Theorem [5] shows that any p G ^3,3 has at most three strong modes. We can show that 
this is true for modes too: 

Proposition 8. The mixture model of three product distributions on {0, l} 3 cannot realize 
distributions with four modes: Ms^s D Qs = 0. 



Proof. See Appendix. □ 

Remark 9. The sets Q% and Q% are intersections of half-spaces that contain the uniform 
distribution. Although ^3,3 is full dimensional in P3, its complement contains points 
arbitrarily close to the uniform distribution! 

The model RBM^ is contained in M.^ and has co-dimension one in V4. Its algebraic 
implicitization was studied in [7], i.e., its description as the set of zeros of a collection of 
polynomials. It was found to be the zero locus of a polynomial of degree 110 in as many 
as 5.5 trillion monomials. By Theorem [5| ^4,4 H H4 = and RBM^ = 0- Using 
Proposition [8] and Lemma 33 (in the Appendix) it is possible to show: 



Corollary 10. The models Ma,a and RBM^ do not contain distributions with 8 modes: 
.M 4 4 H £4 = and RBM 4 2 n Qa = 0- 



4 Modes of RBMs 



In this section, we characterize the sets of modes and strong modes that can appear in RBMs. 
This is a more complex problem than it was for mixtures of product distributions (Section 
[3]), and will necessitate developing characterizations in terms of the interaction of certain 



point configurations called zonosets (Definition 13) and hyperplane arrangements, and in 
terms of linear threshold functions. Again this helps provide a polyhedral approximation of 
the set of representable distributions. 

Problem 11. What is the smallest m G N for which RBM n5m contains a distribution with 
/ (strong) modes? 

Remark 12. The model RBM n5m contains any distribution with support of cardinal- 
ity min{m + l,2 n } [19j Theorem 1], and therefore it contains some distributions with 
min{m + l,2 n_1 } strong modes; for example RBM U)m contains a uniform distribution on 
a code of cardinality min{m + l,2 n_1 } and minimum Hamming distance 2. In particu- 
lar, if M. n ,k+i contains a distribution with strong modes C, then RBM^ also contains a 
distribution with strong modes C. 

By Theorem[5]and RBM n>m C M n ,2™, any p G RBM n?m has at most min{2 m , 2 n_1 } strong 
modes. We will see that this bound is often tight. 



4.1 Zonosets and hyperplane arrangements 

Definition 13. Let m > and n > be integers. Let wi G W 1 Vz G [m] and b G W 1 . The 
multiset Z — {b + J2 ieI ^}/c[m] is called an m-generated zonoset in R n . 

The convex hull of a zonoset is called zonotope. Zonotopes are well known in the literature of 
polytopes. They can be identified with hyperplane arrangements and oriented matroids [3j 
[34] . Given a sign vector s G {±} n , or a binary vector s G {0, l} n , the s-orthant of IR n 
consists of all vectors x G (IR\{0}) n with sgn(x) = 8, or H(x) = s, where H is the Heaviside 
step function. We say that an orthant has even (odd) parity if its sign vector has an even 
(odd) number of +. The following characterizes strong modes of RBMs: 

Theorem 14. Let C C {0, l} n be a binary code of minimum Hamming distance two. 

• If RBM n?m contains a distribution with strong modes C, i.e., RBM n5m {~XHc 7^ 0; 
then there is an m-generated zonoset with a point in each C-orthant ofW 1 . 
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• If there is a zonoset intersecting exactly the C-orthants of R n at points of equal 
h-norm, then RBM n , m V\U C + 0. 

Proof If there is a p in RBM n5m fXHci then for each x G C there is an h G {0, l} m for 
which p(-\h) is uniquely maximized by x (Theorem [5] and RBM n)m C M n ,2 m )- Hence 
(hW + B)x > (hW + B)v for all v ^ x, and equivalently, sgn(ftW + 5) = sgn(x - |). The 
existence of such W and B is equivalent to the existence of a zonoset with a point in each C- 
orthant of R n , as in the first claim. Assume now that W, B can be chosen such that all vectors 
hW + B have the same h norm, K. We have \K = \\\hW + B\\ x = (hW + - §) = 

(hW + B)^ + ftC + If 7 , where C = -\W(1, . . . , 1) T , and = . . . , 1) T , for some 

Xh £ C for all ft G {0, l} m . The RBM with parameters aW,aB, C = -aW\(l, . . . , 1) T , 
and a — > oo produces X^e{o i}™ ^x h /^ m £ as its visible distribution. □ 

Remark 15. The first part of Theorem [14] remains true if we extend He to the set of 
distributions for which any M. n , 2™ -decomposition has a mixture component with mode c, 
for every c G C. 

The model RBM n5m is symmetric under relabeling of any of its variables; in particular, there 
is an RBM distribution with strong modes C iff there is one with strong modes C®2^ = {c+x 
mod (2) : c G C} for any x G {0, l} n . 

Definition 16. A hyperplane arrangement A in R n is a finite set of (afBne) hyperplanes 
{Hi} ie ^ in R n . Choosing an orientation for each hyperplane, each vector x G M, n receives a 
sign vector sgn A (x) G { — ,0, where (sgn A (x))i indicates whether x lies on the negative 
side of Hi, inside, or on its positive side. The set of all vectors in R n with the same sign 
vector is called a cell of A. 

A necessary condition for the existence of an m-generated zonoset intersecting all C-orthants 
of R n is that the number of orthants of R n that are intersected by an m-dimensional affine 
space is at least \C\. The maximal number of orthants intersected by a d-dimensional linear 
subspace of R n was derived in [29] and can be found in [5] . It is not difficult to derive the 
corresponding number for a d-dimensional affine subspace too: 

€(n,d):=2j2( n ~ 1 ) and £ aff (n, d) := £ M . (4) 

Cover [5] shows that £(n, d) is also the number of partitions of an n-point set in general 
position in R d by central hyperplanes (hyperplanes through the origin). A set of vectors 
in R d is in general position if any d or less are linearly independent. Dually, £ a ff(^, d) can 
be seen as the number of cells of a real <i-dimensional arrangement of n hyperplanes in 
general position [24j[28]. In particular, there are affine hyperplanes of R n intersecting 2 n — 1 
orthants. Figure [4] (right) is an example showing the intersection of a 2-dimensional affine 
subspace of R 3 and 7 orthants of R 3 ; four of odd parity and three of even parity. The 
number of even, or odd, orthants of R n intersected by a generic m-dimensional affine space 
is [|C a ff(n, m)] or [|^aff(^j wi)J • If m < n, then L|^aff(^,^)J > 2 m . This does not imply 
that every collection of 2 m even orthants can be intersected by an m-generated zonoset: 

Proposition 17. If n is an odd natural number larger than one, then there is no (n — 1)- 
generated zonoset with a point in every even, or every odd, orthant of R n . 

Proposition [l7| allows us to describe some distributions that cannot be represented by RBMs. 
A code C C |0, l} n extends another code C C {0, l} r , r < n, if restricting C to some r 
indices yields C . 

Corollary 18. If m is an even nonzero natural number and m < n, then RBM n)m DHc = 
for any code C C {0, l} n extending Z+ j7n +i or Z_ ?m +i. In particular, when n is an odd 
natural number larger than one, RBM n5n _i cannot represent distributions with 2 n_1 strong 
modes. 

Proof. If there is an m-generated zonoset with points in every C-orthant of R n and there is 
a restriction of C to Z +?m+ i or Z_ >m+ i, then there is an m-generated zonoset contradicting 
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Proposition 17 By Theorem 14 , RBM n5m cannot represent distributions with strong modes 



C. □ 



In Appendix [B] (Corollary 42) we extend the statement by the special case RBM6,5, which 
cannot represent distributions with 2 6_1 strong modes. 

Remark 19. As a consequence of Corollary [THJ the hierarchical model on the full bipartite 
graph Kn^m does not contain in its closure any distribution supported on any set y C 
{0,l} n+m with 

) ey} = Z± m +i 

for some 1 < i\ < • • • < z m +i < n. 

4.2 Polyhedral approximation of the full-dimensional model RBM 3)2 

The model RBM^ is particularly interesting, because it is the smallest candidate of an 
RBM universal approximator on {0, l} 3 in terms of the number of mixture components of 
the mixtures of products that it represents, but it has less than 2 n_1 — 1 hidden units, the 
upper bound for the number of hidden units of the smallest RBM universal approximator 
given in [19] . Note that the model RBM31 = A4s j2 is readily full dimensional. 



By Corollary 18 the model RBM3 5 2 is not a universal approximator. We illustrate this 



explicitly: By Theorem [14) if RBM 3 , 2 HH 3 + 0, then 

/+ 

W, +B 1 

sgn 



/ B 

W ± +B 

W 2 + B 



(5) 



\+ + 

up to permutations of rows, but it is quickly tested that this equation cannot be satisfied. 

The set H3 is the disjoint union of = ^z + 3 and = %z_ ;3 - Its volume is 

vol(%)/vol(P 3 ) ^ 0.0078. 

The set T-L^ is the 7-dimensional simplex defined by the intersection of the 8 half-spaces 
with inequalities p(z) > ^Z y :d H ( Zj y)=iP(y) for z e Z +£ and P(y) - for a11 V e Z -£- Its 
vertices are the uniform distributions on the following sets: 

{000, 001, 011, 101}, {011, 101, 110, 111}, {000, 010, 011, 110}, {000, 100, 101, 110}, 

{000}, {011}, {101}, {110}. 

The first four vertices are mixtures of two point measures and one uniform distribution on 
a pair of Hamming distance one. The last four are the point measures on Z+$. By [20j 
Theorem 1], all these distributions are contained in RBMa^ (by symmetry, the vertices of 

are also in RBMs^). The distributions in the relative interiors of the edges of 
between the first four vertices are not in RBMa^. The relative interior of edges connecting 
one of the first four and one of the last four vertices are in RBM3 5 2 if they have support of 
cardinality four and are not if they have support of cardinality five. We believe that in fact 
RBM3 ? 2 C\Qs = 0, but this still needs formal verification. 

4.3 Linear threshold functions and codes 

Definition 20. A linear threshold function (LTF) with m (binary) inputs is a function 
/: {0, l} m -+ {-, +} ; f{x) = sgn(( £ v iXi ) - 6); 

zG[m] 

where v G M m is called weight vector and b G R bias. A subset C C {0, l} m C W 71 is linearly 
separable iff there exists an LTF with f(C) = + and /({0, l} m \ C) = —. For convenience 
we shall identify sign vectors and 0/1 vectors via — o and + 1. The opposite x of a 
binary vector x is the vector given by inverting all entries of x. 
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LTFs are also known as McCulloch- Pitts neurons and have been studied intensively in the 
context of feedforward artificial networks. The problem of separating subsets of vertices of 
the m-dimensional hypercube by hyperplane arrangements (multi-label classification) has 
drawn much attention, see, e.g., [32]. The logarithm of the number of LTFs with m inputs is 
asymptotically of order m 2 [36j [23] . The exact number is only known for m < 9 [33j [2TJ [22] . 
The study of LTFs simplifies when f{x\, . . . , x m ) = . . . ,x^[) for all x G {0, l} m , in 

which case they are called self- dual. If an LTF has an equal number of positive and negative 
points, then it separates every input from its opposite, and is self-dual (see Observation [35] 
in Appendix [A]) . 

Definition 21. A subset C C {0, l} n is an (n, m) -linear threshold code (LTC) iff there exist 
n linear threshold functions fa : {0, l} m — » {0, 1}, i G [n] with 

f 2 (x), f n (x)) G {0, {0, l} m } = C. 

If the functions fa can be chosen self-dual, then C is called homogeneous. 

Remark 22. Let n, m G N and C C {0, l} n . Then the following statements are equivalent: 

• the points of an m-generated zonoset Z in R n are in the C-orthants of R n , 

• the vertices of the m-dimensional cube are in the C-cells of an arrangement of n 
hyperplanes in R m , and 

• the code C is an (n,m)-LTC. 

Theorem 23. Let C C {0, l} n , \C\ = 2 m be a code of minimum distance at least two. If 
RBM n5m contains a distribution with strong modes C, then C is an (n,m)-LTC. If C is a 
homogeneous (n,m)-LTC, then RBM n)m contains distributions with with strong modes C. 



Proof. If C is homogeneous, then the separating hyperplanes may be chosen through the 

RBM n?m contains ^2 ceC S C /\C\ G 
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center of the cube (see Observation 35 ) and by Theorem : 

H c - If RBM n rn can represent a distribution with strong mod es C , then there is an re- 
generated zonoset with points in the C-orthants of R n (Theorem 14 ) and C is defined by n 
linear separations of vertices of the m-cube (Remark 22). □ 



In the following examples an LTF with m inputs is written as a list of the vertices of the 
m-cube (the decimal they represent plus one), with a bar on inputs that yield a negative 
output. 

Example 24. Let n = 3 and m = 2. There are only two ways to linearly separate the 
vertices of the 2-cube into sets of cardinality two (up to opposites): 1234 and 1234. 
These are the only possible columns of an LTC with two inputs (up to opposites). The code 
Z± 5 3 is not a (3, 2)-LTC, because whenever Z±^ is written as a 4 x 3-matrix, it has three 
non-equivalent columns. This shows that there does not exist a 2-generated zonoset with 
vertices in the four even, or odd, orthants of R 3 . This also proves that RBM^ does not 
contain any distributions with four strong modes. 

An alternative way of proving this is: The Hamming distance between any two elements of 
Z±,n is even. If the distance of any two vertices of the 2-cube induced by an arrangement 
of three hyperplanes is even and nonzero, then each edge of the 2-cube is sliced at least 
twice, and in total at least eight edges are sliced (repetitions allowed). On the other hand, 
each plane slices at most two edges of the 2-cube, and so three planes slice at most 6 edges 
(repetitions allowed). 

Example 25. Let n = 4 and m = 3. There are 104 ways to linearly separate the vertices 
of the 3-cube [23]. A complete list appears in [3j Section 3.8]. The vertices of the 3-cube 
are in the Z +5 4-cells of an arrangement of four hyperplanes corresponding to the (4, 3)- 

this 

i4 



LTC with following LTFs: 12345678, 12345678, 12345678, 12345678. By Remark [22 
arrangement corresponds to a 3-generated zonoset with points in the 8 even orthants of 
This zonoset can be realized as follows: 
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Figure 6: The four slicings of the 3-cube discussed in Example 
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This choice of w and b corresponds to a central arrangement of four hyperplanes with two 
hyperplanes slicing each edge of the 3-cube, as shown in Figure [6| 

Example 26. Let n = 5 and m = 4. There are three symmetry types of self-dual LTFs 
with four input bits [2T] . The following are representatives of the three types: 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16; 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16; 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16. 
By Proposition [17| the code Z±^ cannot be realized by any 5 LTFs. 

The following example presents a kind of binary code C of cardinality 2 m for which 
RBM n5m fXHc = 0, which is not covered by Corollary 18 

Example 27. Let n = 5 and m = 3. Let x' ,x" G Z± A with dn(x\ x") = 4, and 
C = {(x 1 ,...,x 5 ): (xi,...,x 4 ) G Z± A ,(x m + 2 ,...,x 6 ) = (1,0, ...,0) if (xi,...,x 4 ) G 
{x' ', x"} and (0, . . . , 0) otherwise}. If C is an LTC, some hyperplane separates two vertices of 
the 3-cube from the other vertices (corresponding to x$ — 1 only for two points) . These two 
vertices must be connected by an edge of the 3-cube. Since dn(x^ x") = 4, four hyperplanes 
pass through this edge. There are only three different central hyperplanes through an edge 
of the 3-cube, but four different central hyperplanes are required to produce Z± A . 



5 When does a mixture of products contain an RBM? 

Using the characterizations obtained in Sections [3] and [4j we now prove the the main result 
discussed in the Introduction. The technical statement of this result is Theorem [29| while 
its interpretation appears in Section [l] and Figure [l] To do this, we derive upper bounds for 
the smallest m such that RBM n5m contains probability distributions with I strong modes, 
and show thereby that RBMs can represent many more modes than MoPs with the same 
number of parameters. 

Any representability result such as Example [25j combined with the following observation, 
yields lower bounds on the smallest mixture of products which contains the RBM model. 
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Remark 28. Assume that G R m * xn * and G generate a zonoset {hW® 

BW : h e {0, l} m *} intersecting ^ even orthants of R n * , for all i G [k]. Then 



w = 



\ 



and £ = (£ (1) ,...,£ (/c) ) 



J 



generate a zonoset {hW + B : h G {0, l} miH ^ m fc} intersecting ^ even orthants of 

]gmiH hn fc _ 

The following Theorem provides the justification for the statement in the introduction that 
"the number of parameters of the smallest MoP model containing an RBM model grows 
exponentially in the number of parameters of the RBM for any fixed ratio < m/n < oo," 
and for Figure [TJ 

Theorem 29. Let n, m G N. 



If4\m/3] < n, then RBM n , m nH n , 2 - + and M Uyk 5 RBM n? , 
If 4\m/3] > n, then RBM n;m nK,L + 0, where L := min{2 z 



J k > 2 m . 

I := max{/ G N: 4|7/3] < n}, and M n ,k 5 RBM n?m only if k> L. 

Remark 30. The statement of the first item remains true if m = 1 mod (3) and 4|_ra/3j +2 < 
n. For n < 3 we have jM n ,fc = RBM U) fc_i for any k G N. For n = 3 we believe that ^3,3 
and RBM3 2 are very similar, if not equal. 

Proof Let 4|~ra/3] < n. The if direction follows from RBM n?m C A"f n ,2™ for all n and 
m. For the only if direction we show that RBM n?m contains a distribution with 2 m strong 
modes. By Theorem [5] such a distribution is in M n ,k on ly if k > 2 m . Consider the following 
parameters: 



W 



I w 



V 



w 



\ 



J 



B = a(b,b,...,b,-l,...,-l); 
b = §(3,1,1,1); 

C = -^(1 1) t =q(1,...,1) t ; 



where a G R is a constant, w is the 3 x 4-matrix defined in eq. (pi), w consists of the 
first or the first two rows of w : and B is a times \m/3] copies of b followed by —Is. Let 
be the set {1, 2, 3, 4} + 4(z — 1) C [n]. For a — >• 00 the visible distribution generated 
by the RBM is the uniform distribution on following subset of Z+^ n of cardinality 2 m : 
{v G {0, l} n : ^Zj e \ i Vj is even for all z, and i>j = for all j > 4|~ra/3]}. Now let 4|~ra/3] > 
n. By the first part, RBM n ^ contains some p with 2 l strong modes in Z+ ?n . Moreover, 
RBM U)Z+ i contains [ip+(l - y)S x for any p G RBM n> j, x G {0, l} n and /i G [6, 1] (see [T7]). 
Each additional hidden unit can be used to increase the number of strong modes by one, 
until the set of strong modes is Z+ jn . □ 



6 When does an RBM contain a mixture of products? 

A complementary question to the title question of this paper is: What is the smallest m for 
which RBM nm contains A4 n ,k ? An interesting instance of this problem is: 

Problem 31. Does RBM nm contain the mixture of products A^ n?m +i? 

Both RBM n m and M. n ,m+i have the same number of parameters, and expected dimension 
min{nm + n + m, 2 n — 1}. The expected dimension is also the true dimension of the mixture 
model, unless n = 4 and ra + 1 = 3 [4], and in most cases it is known to be the true dimension 



of the RBM as well [6]. In the following we give a negative answer to Problem 31 



In the previous section we showed that the nonnegative rank of probability distributions in 
the model RBM n?m is as large as 2 m . That is, there are tables of probabilities (probability 
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distributions) represented by the RBM model which cannot be represented as a nonnegative 
sum of less than 2 m nonnegative rank-one tables (product distributions). The rank of a 
table p is the smallest number k such that p can be written as a sum of k rank-one tables. 
Here, a multivariate probability distribution p = p(x\, . . . , x n ) with X{ £ A^, \Xi\ = T{ for 
i = 1, . . . , n is expressed as an n-way n x • • • x r n -table with value p(xi, ...,x n ) at the 
entry (xi, . . . , x n ). A rank-one table is an outer-product of n vectors of lengths r*i, . . . , r n . 
A product distribution in M. n ,\ is the outer-product of the marginal distributions on the 
variables x\ through x n and is a nonnegative rank-one table. By definition, the elements of 
M. n ,k have nonnegative rank at most /c, and therefore also rank at most k. Since RBM n?m is 
contained in M. n ,2™, any p £ RBM n5m has rank at most 2 m . We will show that this upper 
bound is tight when 2m < n. 

Two models A and B are called generically distinguishable if An B has relative measure zero 
in A and in B. The restriction "generically" is useful, because in most cases of interest the 
models do intersect (e.g., mixtures of products and RBMs contain the uniform distribution). 
A flattening of a table of probabilities is a way of arranging its entries into a two-table (i.e., 
a matrix) by grouping the variables into two groups and considering the joint states of the 
variables in each of the groups as the states of two variables. The following is an example 
of a flattening of a table p on four binary variables: 

/Poo,oo Poo, 01 Poo, 10 
Poi,oo Poi,oi Poi,io 
Pio,oo Pio,oi Pio,io 
Vph,oo Pll,01 Pll,10 

The matrix rank of any flattening of a table p is upper bounded by the outer-product rank 
of p. In particular, the vanishing of the (k + 1) x (k + l)-minors of flattenings are algebraic 
invariants of the model M n ,k- 

Theorem 32. If m < n/2, then the model RBM U)m contains points of rank 2 m . If fur- 
thermore m + 1^3 or n ^ A, then the models RBM n5m and M. n ,m+i have both dimension 
nm + n + m and intersect at a set of dimension strictly less than nm + n + m. 



V 




Proof. We show that if m < n/2, then RBM n5m contains a point p with a flattening of 
rank 2 m , which implies that p has outer-product rank 2 m . The flattenings of any q £ A4 n ,k 
have rank at most k. This gives an algebraic invariant of the mixture of products model 
M-n,m+i which is not satisfied by elements of RBM U)m . Hence, if both models have the 
same dimension then they intersect at a set of dimension strictly less than d. 

Consider the m-cube and the 2m hyperplanes through its center consisting of translates of 
the coordinate hyperplanes with multiplicity two. This hyperplane arrangement slices each 
edge of the m-cube exactly twice and generates a (2m, m)-LTC C of minimum distance two. 
The code C consists of the 2 m binary vectors x in {0, l} 2m with X{ = Xi+\ for all odd z, and 
#3, . . . , X2m-i) x e C} = {0, l} m . In the case m = 3, for example, the code is 



(7) 



By Theorem [23} RBM 2m , m contains the uniform distribution on C, uq. View uc as a linear 
transformationirom the 2 m -dimensional space of real valued functions on xi, £3, . . . , #2m-i 
to the space of functions on #2, #4, . . . , X2 m , then 

/l/2 m \ 
l/2 m 
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which has rank 2 Ti 



(8) 



□ 
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Discussion 



Our results show that RBMs can represent distributions with many strong modes much 
more compactly than standard mixture models. This gives a concise, combinatorial way of 
differentiating the two models. 

RBMs create a multi-labeling of their input space by the most likely joint states of their 
hidden units given the inputs. The number of inference regions that can be generated in this 
way is of exponential order in the number of RBM parameters. The partitions of R n gener- 
ated by an RBM with n visible and m hidden units can be identified with the intersections 
of affine spaces of dimension d < min{n, m} with the orthants of R m , whereas there is one 
affine space for each choice of the RBM parameters. We elaborated on the combinatorics 
of the resulting hyperplane arrangements, and on the combinatorics of point configurations 
in such hyperplane arrangements, in correspondence with the inference functions on the set 
of binary input vectors {0, l} n C W 1 . The theory of hyperplane arrangements and linear 
separation of points is well studied but still poses many questions. 

We analyzed the sets of strong modes of probability distributions represented by RBMs and 
related them to the hyperplane arrangements and linear threshold codes (multi-labelings). 
The products of mixtures represented by RBMs are compact representations of probability 
distributions which have many strong modes; namely of order min{2 m ,2 n_1 } for the RBM 
with n visible and m hidden binary units, and thus of exponential order in the number 
of parameters. Mixture models of product distributions (naive Bayes models), in contrast, 
generate less restricted input space partitions but into at most as many regions as mixture 
components, and can only represent probability distributions with a number of strong modes 
that is linear in the number of model parameters. 

In particular, the smallest mixture model of product distributions that contains an RBM 
models is as large as one can possibly expect, having one mixture component per state 
of the RBM hidden units, and thus a number of parameters that is exponential in the 
number of RBM parameters. Fixing dimension, the RBMs which are hardest to represent 
as mixtures of product distributions are those with about the same number of visible and 
hidden units. These results aid our understanding of how models complement each other, 
and why distributed representations in deep learning [2] can be expected to succeed, or 
when model selection can be based on theory rather than trial- and-error. They confirm 
the intuition that distributed representations are exponentially more powerful than non- 
distributed ones, in the case of binary RBMs and taking the number of strong modes as a 
measure of complexity. 

Other measures of complexity of probability distributions such as multi-information, defined 
as the Kullback-Leibler divergence to the set of product distributions, are interesting, but not 
necessarily best for differentiating between MoPs and RBMs. In terms of multi-information 
the most complex binary probability distributions have the formp = \{5 x -\-8 y ) with Xi+yi = 
1 for all z, see pQ, and are contained in any (non-trivial) MoP and RBM models. 

We note that there may exist small mixtures of product distributions which cannot be 
compactly represented by RBMs. We showed that M. n ,m+i 2 RBM n?m when 3 < m < n/2. 



Acknowledgments 



The authors are grateful to Johannes Rauh for helpful discussions on algebraic invariants, 
to Nihat Ay for comments on hyperplane arrangements, and to Yoshua Bengio for helpful 
discussions regarding distributed representations. The authors acknowledge use of the RCC- 
ITS computer cluster at the Pennsylvania State University. This work is supported in part 
by DARPA grant FA8650-11- 1-7145. 



14 



Appendix 



A Proofs 

Proof of Proposition^ Let X := {0,l} n , V = V(X), and let V(y) be the simplex of 
probability distributions strictly supported on y C X . This is a regular (\y\ — l)-simplex in 
Rl^l ; all edges have the same length y/2. Let H(y) denote the set of distributions which have 
strong modes at y. Then H(y) = r\ y ey^(y)- For any y G X : denote by Bi(y) the Hamming 
ball with center y and radius 1. The set V y (B\(y)) := {p G V(Bi(y)): p(y) > p(X\{y})} is 

a regular n-simplex of side length ^ (its vertices are {^(^ + ^)}d H (^,y)<i)- The vomme 
of a regular TV-simplex with side length I is ~j^pr The set is the convex hull 

of V y (B 1 (y)) and \ #i (?/)), and hence vol'(H(?/)) = 2- n vol(P). If y has minimum 
distance 3 or more, vol(H(y)) = 2~\ y \- n vol(P). If the minimum distance of 3^ is two, then 
the volume of H(y) is larger. 

The number K(m+\) is a lower bound on the number of disjoint sets T~L{y) with \y\ = m+1. 
The Gilbert- Varshamov bound tells us that if m + 1 < 2 fc , where A: is the largest integer 
for which 2 k < there exists a binary code y C X, \y\ = m + 1 with minimum distance 
3. Let y = \ {?/}) U {y ®2 ^i} (flip one coordinate of one element of y). We have that 
T-i{y) fl H(y') = 0. Since y has (m + 1) elements, there are 2 m+1 disjoint sets H. For 
any m + 1 < 2 n_1 , if 3^ is a binary code of minimum distance 2, then also 3^ ®2 ei, and 

W) n = 0. ! □ 

Proof of Proposition^ Assume that ^3,3 fl ^3" 7^ 0. By Lemma [33] there exist factors 
(p hl ,p l,2 ,p h3 ) G (Vi) 3 ,i = 1,2,3 such that conv{(f := p l,2 p l,3 }i=i y 2,3 intersects ^ and ^2~, 
and intersects the boundary of each of them exactly once. See Figure [5] Hence convjg 1 , q 2 } 
intersects Q^, and conv{g 2 , q 3 } intersects (for some enumeration of g 1 ,^ 2 ,^ 3 ). The 
mixture of q 2 and q 3 intersects only if {(01), (10)} are the maxima of q 2 and q 3 . Similarly, 
if convjg 1 , q 2 } intersects Q^, then {(11), (00)} are the maxima of q 1 and q 2 ; a contradiction. 

□ 

The proof of Proposition [8] makes use of the following lemma, which relates the number of 
modes realizable by M. n ,k to the number of modes simultaneously realizable on subsets of 
variables. 

Lemma 33. Let n, k G N and n > 2. Let p = ^2 ie ^ ^iYlje[n]P 1 ^ ^ M. n ,k> with Xi > 
and p 1 ^ G V\ for all G [k] x [n]. If p has 2 n_1 modes, then for any subset of variables 
I £ [n], \I\ = m, the convex hull of the product distributions {Ylj eI P hj }ie[k] C P m intersects 
both and (7~ . 

Proof. We show the special case with / = [m] = {l,...,n — 1}. The proof of the general 
case is a straightforward generalization of the proof of this special case. Any q G M n ,k has 
the following form: 

k 

q(x 1 ,x 2 ,...,x n ) =J2\iP i,1 (x 1 )p i > 2 (x 2 )---p i > n (x n ) , 

i=l 

for all (xi,X2, • • • ,x n ) G {0, l} n , where Yli=i = 1, > and p 1 ^ G V\. For the fixed 
value x\ = this can be understood as a mixture of k products with (n — 1) variables, 
multiplied by a positive constant: 

k 

q(x 1 = 0,x 2 , . . . ,x n ) =c ^ A ,iP*' 2 (x 2 ) • • -p hn (x n ) , 

i=l 

q (x 2 ,...,x n ) 
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where Yli=i ^o,i = 1? ^o,i > with 

A^a^O) ^ f A i,i (![1=0) . 

A similar observation can be made for the fixed value xi = 1. If the distribution q is 
contained in then a) go £ £?n-i> which is to say that g satisfies the inequalities describing 
involving coordinates from the set {(0, #2, • • ? #n) : (#2? • • , #n) £ {0, l} n_1 }. b) gi G 
$n-\-> which is to say that q satisfies the inequalities describing involving coordinates 
from the set {(1, #2, • • • , x n ) : (^2, . . . , x n ) G {0, l} n_1 }. The distributions qo and gi are 
mixtures of the same k product distributions {p h2 • • ■ p z,n }ie[k], although they may have 
different mixture weights. □ 



Proof of Proposition 17 Let Z be a candidate zonoset; Z has (n — 1) generators, so it lies in 
an affine hyperplane H of R n . Let 77 be a normal vector to H. Assume first that G H. All 
vectors in the orthants sgn(?7) and — sgn(^) lie outside H (where we may assign arbitrary 
sign to zero entries of 77). This follows from Stiemke's theorem, see, e.g., [9]. The two 
orthants have opposite sign vectors and n is odd, so one orthant has even parity and the 
other, odd. So at least one even and one odd orthant don't intersect Z. Consider now an 
affine subspace H, and assume it intersects all even orthants. By eq. Q dim(i^) > n — 1, 
so H is a hyperplane. Assume without loss of generality that a normal vector to H has only 

negative entries. Then H flMJ 1 n is an (n — 1) -dimensional simplex containing a point of 

Z. This can be inferred from the number of bounded cells in a d-dimensional arrangement 
of n hyperplanes in general position: b(n,d) = ( n ^ X ), EJ Proposition 2.4]. The orthant 

R™ j is separated by (n — 1) coordinate hyperplanes from the orthant R™. with sign 

Si = (+••• H !-•••+) for any i G [n]. Since n is odd and larger than one, (n — 1) > is 

even. As Z intersects H nM™ for all i G [n], i7 DIR^ n C conv(^) (details in Lemma[34|. 

On the other hand, the (n— regenerated zonotope Z of dimension (n— 1) is combinatorially 
equivalent to the (n — l)-cube, and no point in its zonoset is contained in the convex hull 
of any other points. □ 



We used the following lemma in the proof of Proposition [T7| 

Lemma 34. Let P be a polytope. Let v be a vertex of P and {i^}[ =1 the supporting 
hyperplanes of the facets of P incident to v, and assume P is contained in the intersection 
of closed half-spaces DiH^ . If V is any point in P\iH^ , then the polytope conv({v / }U(F \v)) 
contains P. 



Proof. The case v r = v is trivial, so let v r 7^ v. It is sufficient to show that v is not a vertex 
of Q := conv({^} U V), from which v G conv({v'} U (V \ v)) and P C conv({t/} U (V \ v)) 
follows. The point v' is a vertex of Q, because v' P. Consider first the case where v' is in 
the interior of HH^ , which is to say that v' is not contained in any B.f . If u was a vertex 
of Q, then one i^i would support a facet of Q (otherwise v' would be incident to all facets 
incident to v). This would contradict the fact that v' Hf . The general case v' G C\H~ 
results from continuity. □ 



For completeness we give a proof of the following: 

Observation 35. Suppose a linear threshold function has an equal number of positive and 
negative points. Then it separates every point of the m-cube and its opposite, and is self-dual. 



Proof. Each linear threshold function is defined by a hyperplane H(v,b) normal to a vector 
v G W 71 shifted by b G R and has positive points on the m-cube C+(u, b) = {x G { — 1, l} m : 
v • x + b > 0}. We may assume that H(v,0) does not intersect any points of { — 1, l} m ; if 
it did, we could perturb it slightly without changing C+(v, b). Since H(v,Q) passes through 
the origin, C+(v,0) contains exactly 2 m_1 points, and if x G C+(i>,0) then its complement 
x G C-(v,0). Since C + (i>,0) C C+(v,6), if C+(v, 6) contains any pair x, x, it must contain 
strictly more than 2 m_1 points. □ 
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B Multi-covering numbers of hypercubes 



Remark [28] and Theorem 29 show that the smallest k for which M. n ^ 5 RBM n5m , is 
intimately related to the solution of the following problem: 

Problem 36. Let m < n. Consider an m-generated zonoset Z in R n which does not 
intersect any two orthants separated by a single coordinate hyperplane. How many orthants 
of R n does Z intersect at most? 



This problem is related to the long standing problem of computing the covering numbers of 
hypercubes, as we discuss in the following. 

Definition 37. The covering number of a hypercube is the smallest number of hyperplanes 
that slice each edge of the hypercube at least once. An edge is sliced by a hyperplane if 
the hyperplane intersects the relative interior of the edge, and does not contain any vertices 
of the hypercube. A cut is the collection of all edges sliced by some hyperplane. Each cut 
corresponds to a linear threshold function. 

For any m, the m hyperplanes with normal vectors equal to the standard basis of R m passing 
through the center of the m-dimensional hypercube slice all its edges. This arrangement is 
not always optimal. Paterson found 5 hyperplanes slicing all edges of the 6-cube, see [27] . 
This shows that covering numbers do not behave trivially. The covering numbers are known 
only for hypercubes of dimension < 6. Computing them in higher dimensions is challenging, 
even in the cases where all cuts are known. Now: 

Proposition 38. If Z+^ n is an (n,n — l)-LTC, then there exists an arrangement of n hy- 
perplanes through the center of the (n — l)-cube slicing each edge an even nonzero number 
of times. 



Proof. Given the assumption, there exists an arrangement of n hyperplanes in R n_1 such 
that each vertex of the (n — l)-cube is in one of the Z +5n -cells of the arrangement. Each 
vertex is separated by an even, positive number of hyperplanes from any other vertex, since 
any two elements of Z+ jU differ in an even number of entries. Two vertices may not be 
contained in the same cell of the arrangement, since \Z+ yU \ = 2 n_1 equals the total number 
of vertices of the (n — l)-cube. The code ^+, n is homogeneous, as each coordinate i G [n] 
has the same number of zeros and ones, and each hyperplane in the arrangement can be 
chosen through the center of the cube (Observation [35]) . □ 

Proposition [38] motivates the following version of the covering number problem: 

Problem 39 (Multi-covering number). What is the smallest arrangement of hyperplanes, 
if one exists, that slices each edge of a hypercube a specific number of times? 

Of particular interest is the number of hyperplanes needed to slice each edge of the m-cube 
an even nonzero number of times. The edges of the m-cube can be sliced exactly twice by 2 m 
hyperplanes with normal vectors the standard basis vectors of R m counted with multiplicity 
two. Proposition [T7| shows that if m is even and larger than zero, there is no arrangement 
of (m + 1) hyperplanes for which each vertex of the m-cube lies in a different cell and any 
two vertices are separated by an even number of hyperplanes. This suggests that when m is 
even, there is no arrangement of (m + 1) hyperplanes slicing all edges of the m-cube exactly 
twice; at least not one for which each vertex lies in a different cell. There is exactly one 
way to slice all edges of the 3-cube an even, nonzero number of times by four hyperplanes, 
namely the way illustrated in Figure [6] To see that this is the only way, note that the 3-cube 
has twelve edges and that there are only four different cuts that slice six edges. The 4-cube 
has 16 vertices, 32 edges, a total of 940 different cuts, 3 symmetry classes of central cuts, 
and 52 different central cuts. The maximal number of edges sliced by a cut is 12. Hence: 

Proposition 40. There is no arrangement of five hyperplanes, or less, slicing each edge of 
the four- dimensional cube at least twice. 

The complexity of the next easiest example is considerable. We tested all combinations of 
six cuts of the 5-cube and found: 
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Computation 41. There does not exist an arrangement of six, or less, central hyperplanes 
slicing all edges of the five- dimensional cube an even nonzero number of times. 

Special instances of the statement are much easier to compute, for example: There is no 
arrangement of six central hyperplanes which slices each edge of the five- dimensional cube 
exactly twice. In the following we explain some details of the computation. The 5-cube has 
80 edges. There are 47285 different ways of slicing them with affine hyperplanes, see [8]. A 
cut is given by the indicator function on the set of edges sliced. A list of the cuts can be 
found in [35]. An edge of the m-cube corresponds to a pair of binary vectors of length m, 
which differ in exactly one entry. Each edge is parallel to one coordinate vector of R m . The 
edges can be organized in m groups, corresponding to their directions. Within each group 
the edges are naturally enumerated by the binary vectors of length (m — 1) containing the 
coordinate values that are equal for the two vertices of each edge. The central cuts can 
be characterized as the cuts which involve only pairs of opposite edges. The 5-cube allows 
7 symmetry classes of central cuts and 941 different central cuts. For each choice of 6, or 
less, central cuts we computed the entry-wise addition of the indicator functions and found 
that this never produced an even, nonzero value in each entry. On the other hand, 5 is the 
covering number of the 5-cube, see [8], and hence at least 6 hyperplanes are needed to slice 
each edge twice. As a consequence of Computation [4lj 

Corollary 42. The model RBM 65 cannot represent probability distributions with 32 strong 
modes. 

Indeed, we trained RBM6,5 to approximate the uniform probability distribution on and 
found a Kullback-Leibler divergence minimum of 0.6309 (with basis- two logarithm), which 
is a relatively large value. For this numerical experiment we used contrastive divergence [T4] 
and maximum likelihood methods with extensive parameter initializations. 
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